This notebook presents isobaric labeling data analysis strategy that includes data-driven normalization.
We will check how varying analysis components [summarization/normalization/differential abundance testing methods] changes end results of a quantitative proteomic study.
Summarize quantification values from PSM to peptide (first step) to protein (second step).
Notice that the row sums are not equal to Ncols anymore, because the median summarization does not preserve them (but mean summarization does).
Let’s also summarize the non-normalized data for comparison in the next section.
Note that sum normalization is NOT equivalent to mean normalization: rows containing NA values are removed, but there may be multiple PSMs per peptide and multiple peptides per protein. Since we know that there is a strong peptide-run interaction, summing values across peptides per protein may result in strong bias by run.
Let’s also summarize the non-normalized data for comparison in the next section.
MA plots of two single samples taken from condition 1 and condition 0.125, measured in different MS runs (samples Mixture2_1:127C and Mixture1_2:129N, respectively).
MA plots of all samples from condition 1 and condition 0.125 (quantification values averaged within condition).
Before normalization, the median and iPQF plots are not that similar, but after normalization they are quite similar!
Only use spiked proteins
TO DO: - !!! only 1000 first record selected to speed up knitting, remove in the final version!!! - also use short label names like in PCA plot - unify the list of args across pcaplot.ils and dendrogram.ils. Make sure labeling and color picking is done in the same location (either inside or outside the function)
TODO: - Also try to log-transform the intensity case, to see if there are large differences in the t-test results. - done. remove this code? NOTE: - actually, lmFit (used in moderated_ttest) was built for log2-transformed data. However, supplying untransformed intensities can also work. This just means that the effects in the linear model are also additive on the untransformed scale, whereas for log-transformed data they are multiplicative on the untransformed scale. Also, there may be a bias which occurs from biased estimates of the population means in the t-tests, as mean(X) is not equal to exp(mean(log(X))).
Confusion matrix:
Confusion matrix for variant: median| contrast | background | spiked | |
|---|---|---|---|
| not DEA | 0.667 | 4064 | 0 |
| DEA | 0.667 | 15 | 4 |
| not DEA | 0.125 | 4061 | 3 |
| DEA | 0.125 | 5 | 14 |
| not DEA | 1 | 4059 | 5 |
| DEA | 1 | 5 | 14 |
| 0.667 | 0.125 | 1 | |
|---|---|---|---|
| Accuracy | 0.9963262 | 0.9980407 | 0.9975508 |
| Sensitivity | 0.2105263 | 0.7368421 | 0.7368421 |
| Specificity | 1.0000000 | 0.9992618 | 0.9987697 |
| PPV | 1.0000000 | 0.8235294 | 0.7368421 |
| NPV | 0.9963226 | 0.9987703 | 0.9987697 |
| contrast | background | spiked | |
|---|---|---|---|
| not DEA | 0.667 | 4063 | 1 |
| DEA | 0.667 | 16 | 3 |
| not DEA | 0.125 | 4061 | 3 |
| DEA | 0.125 | 5 | 14 |
| not DEA | 1 | 4060 | 4 |
| DEA | 1 | 5 | 14 |
| 0.667 | 0.125 | 1 | |
|---|---|---|---|
| Accuracy | 0.9958364 | 0.9980407 | 0.9977957 |
| Sensitivity | 0.1578947 | 0.7368421 | 0.7368421 |
| Specificity | 0.9997539 | 0.9992618 | 0.9990157 |
| PPV | 0.7500000 | 0.8235294 | 0.7777778 |
| NPV | 0.9960775 | 0.9987703 | 0.9987700 |
| contrast | background | spiked | |
|---|---|---|---|
| not DEA | 0.667 | 4064 | 0 |
| DEA | 0.667 | 19 | 0 |
| not DEA | 0.125 | 4064 | 0 |
| DEA | 0.125 | 16 | 3 |
| not DEA | 1 | 4064 | 0 |
| DEA | 1 | 16 | 3 |
| 0.667 | 0.125 | 1 | |
|---|---|---|---|
| Accuracy | 0.9953466 | 0.9960813 | 0.9960813 |
| Sensitivity | 0.0000000 | 0.1578947 | 0.1578947 |
| Specificity | 1.0000000 | 1.0000000 | 1.0000000 |
| PPV | NaN | 1.0000000 | 1.0000000 |
| NPV | 0.9953466 | 0.9960784 | 0.9960784 |
Scatter plots:
Volcano plots:
Violin plots:
Let’s see whether the spiked protein fold changes make sense
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_BE.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_BE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_BE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_BE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] dendextend_1.14.0 CONSTANd_0.99.0 forcats_0.5.0
## [4] stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4
## [7] readr_1.4.0 tidyr_1.1.2 tibble_3.0.4
## [10] tidyverse_1.3.0 MSnbase_2.15.7 ProtGenerics_1.21.0
## [13] S4Vectors_0.27.14 mzR_2.23.1 Rcpp_1.0.5
## [16] Biobase_2.49.1 BiocGenerics_0.35.4 kableExtra_1.3.1
## [19] psych_2.0.9 gridExtra_2.3 RColorBrewer_1.1-2
## [22] stringi_1.5.3 limma_3.45.19 caret_6.0-86
## [25] ggplot2_3.3.2 lattice_0.20-41
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.4-1 ellipsis_0.3.1 class_7.3-17
## [4] fs_1.5.0 rstudioapi_0.11 farver_2.0.3
## [7] affyio_1.59.0 prodlim_2019.11.13 fansi_0.4.1
## [10] lubridate_1.7.9 xml2_1.3.2 codetools_0.2-16
## [13] splines_4.0.3 ncdf4_1.17 mnormt_2.0.2
## [16] doParallel_1.0.16 impute_1.63.0 knitr_1.30
## [19] jsonlite_1.7.1 pROC_1.16.2 broom_0.7.2
## [22] vsn_3.57.0 dbplyr_1.4.4 BiocManager_1.30.10
## [25] compiler_4.0.3 httr_1.4.2 backports_1.1.10
## [28] assertthat_0.2.1 Matrix_1.2-18 cli_2.1.0
## [31] htmltools_0.5.0 tools_4.0.3 gtable_0.3.0
## [34] glue_1.4.2 affy_1.67.1 reshape2_1.4.4
## [37] MALDIquant_1.19.3 cellranger_1.1.0 vctrs_0.3.4
## [40] preprocessCore_1.51.0 nlme_3.1-150 iterators_1.0.13
## [43] timeDate_3043.102 gower_0.2.2 xfun_0.18
## [46] rvest_0.3.6 lifecycle_0.2.0 XML_3.99-0.5
## [49] zlibbioc_1.35.0 MASS_7.3-53 scales_1.1.1
## [52] ipred_0.9-9 pcaMethods_1.81.0 hms_0.5.3
## [55] yaml_2.2.1 rpart_4.1-15 highr_0.8
## [58] foreach_1.5.1 e1071_1.7-4 BiocParallel_1.23.3
## [61] lava_1.6.8 rlang_0.4.8 pkgconfig_2.0.3
## [64] mzID_1.27.0 evaluate_0.14 labeling_0.4.2
## [67] recipes_0.1.14 tidyselect_1.1.0 plyr_1.8.6
## [70] magrittr_1.5 R6_2.4.1 IRanges_2.23.10
## [73] generics_0.0.2 DBI_1.1.0 mgcv_1.8-33
## [76] pillar_1.4.6 haven_2.3.1 withr_2.3.0
## [79] survival_3.2-7 nnet_7.3-14 modelr_0.1.8
## [82] crayon_1.3.4 tmvnsim_1.0-2 rmarkdown_2.5
## [85] viridis_0.5.1 grid_4.0.3 readxl_1.3.1
## [88] data.table_1.13.2 blob_1.2.1 ModelMetrics_1.2.2.2
## [91] reprex_0.3.0 digest_0.6.27 webshot_0.5.2
## [94] munsell_0.5.0 viridisLite_0.3.0